Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

52 ◾ Bioinformatics

To count the total number of bases in the reference file, you can combine “grep”, “wc”, and

“awk” commands as follows:

grep -v “>” GRCh38.p13_ref.fna | wc | awk ‘{print $3-$1}’

If for any reason, you want to split the reference sequences into files, you can use the fol-

lowing script that creates the directory, “chromosomes”, and then it splits the main FASTA

file into several FASTA files:

mkdir chromosomes

cd chromosomes

csplit -s -z ../GRCh38.p13_ref.fna ‘/>/’ ‘{*}’

for i in xx* ; do \

n=$(sed ‘s/>// ; s/ .*// ; 1q’ “$i”) ; \

mv “$i” “$n.fa” ; \

done

The annotation files relevant to a reference genome may also be needed for some of the

steps in the downstream analysis. You can download the annotation file as above. The

annotation file is a description of where genetic element also called a feature such as genes,

introns, and exons are located in the genome sequence, showing the start and end coordi-

nates, and feature name. The annotation files are usually in GFF or GTF file format. The

GFF (General feature format) is a simple tab-delimited text file for describing genomic

features and mapping them to the reference sequence in the FASTA file. The GTF (Gene

Transfer Format) is similar to GFF but it has additional elements. Figure 2.3 shows the

first part of the human annotation file in the GFF format. The content of an annotation file

including the chromosome name or chromosome GenBank accession in the first column,

and features and other annotations are in the other columns.

Both the FASTA reference file and (sometimes) its annotation file are required by the

alignment programs, shortly called aligners, for mapping the reads in the FASTQ files to

FIGURE 2.2 Part of the FASTA sequence of the human reference genome.